Scientific Data
○ Springer Science and Business Media LLC
All preprints, ranked by how well they match Scientific Data's content profile, based on 174 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.
Pedro A. Valdes-Sosa; Jorge F. Bosch-Bayard; Lidice Galan Garcia; Maria L Bringas Vega; Eduardo Aubert Vazquez; Samir Das; Trinidad Virues Alba; Cecile Madjar; Zia Mohades; Leigh C. MacIntyre; Chrystine Rogers; Shawn Brown; Lourdes Valdes Urrutia; Iris Rodriguez Gil; Alan C. Evans; Mitchell J. Valdes Sosa
Show abstract
The Cuban Human Brain Mapping Project (CHBMP) repository is an open multimodal neuroimaging and cognitive dataset from 282 healthy participants, age range 18 to 68 years (mean 31.9 SD 9.3 years). This dataset was acquired from 2004 to 2008 as a subset of a larger stratified random sample of 2,019 participants from La Lisa municipality in La Habana, Cuba. The exclusion included presence of disease or brain dysfunctions. The information made available for all participants comprises: high-density (64-120 channels) resting state electroencephalograms (EEG), magnetic resonance images (MRI), psychological tests (MMSE, Wechsler Adult Intelligence Scale WAIS III, computerized reaction time tests using a go no-go paradigm), as well as general information (age, gender, education, ethnicity, handedness and weight). The EEG data contains recordings with at least 30 minutes duration including the following conditions: eyes closed, eyes open, hyperventilation and subsequent recovery. The MRI consisted in anatomical T1 and T2 as well as diffusion weighted (DWI) images acquired on a 1.5 Tesla system. The data is available for registered users on the LORIS database which is part of the MNI neuroinformatics ecosystem.Competing Interest StatementThe authors have declared no competing interest.
Levchenko, E.; Chow-Wing-Bom, H.; Dick, F.; Tierney, A. T.; Skipper, J. I.
Show abstract
We provide a multimodal naturalistic neuroimaging database (NNDb-3T+), designed to support the study of brain function under both naturalistic and controlled experimental conditions. The database includes high-quality 3T fMRI data from 40 participants acquired during full-length movie-watching and three sensory mapping tasks: somatotopy, retinotopy, and tonotopy. Each participant also completed synchronised eye-tracking during movie-watching and retinotopy, physiological recordings, and a battery of behavioural and cognitive assessments. Data were collected across two MRI sessions and a remote testing session, with all data organised in a BIDS-compliant format. Technical validation confirms high data quality, with minimal head motion, accurate eye-tracker calibration, and robust task-evoked activation patterns. The database provides a unique resource for investigating individual differences, functional topographies, multimodal integration, and naturalistic cognition. All raw and preprocessed data, quality metrics, and preprocessing scripts are publicly available to support reproducible research.
Gong, R.; Ichinohe, N.; Abe, H.; Tani, T.; Lin, M.; Okuno, T.; Nakae, K.; Hata, J.; Ishii, S.; Delmas, P.; Heidari, S.; Wang, J.; Yamamori, T.; Okano, H.; Woodward, A.
Show abstract
We present our new Brain/MINDS 3D digital marmoset brain atlas version 2.0 (BMA2.0), a population-based 3D digital brain atlas of the common marmoset (Callithrix jacchus), designed to overcome the limitations of previous single subject atlases that are prone to structural biases arising from individual variation. Here, manually delineated cortical regions from 10 myelin-stained brains were used to create a generalized cortical parcellation. Newly refined subcortical regions from a previous atlas and a completely new cerebellum parcellation were also incorporated, resulting in a comprehensive whole brain parcellation for both hemispheres. To facilitate multimodal data analysis, the atlas package includes co-registered average templates for myelin and Nissl staining from the same individuals, ex vivo MRI T2 (91 individuals), and in vivo MRI T2 (446 individuals). Cortical flat maps and pial, cortical mid-thickness, and white matter surfaces are also provided. BMA2.0 provides a central brain space for multimodal data integration, spatial analysis, and comparative neuroscience. Standard formats and transformations are provided for easy integration into existing workflows and interoperability with existing atlases.
Nastase, S. A.; Liu, Y.-F.; Hillman, H.; Zadbood, A.; Hasenfratz, L.; Keshavarzian, N.; Chen, J.; Honey, C. J.; Yeshurun, Y.; Regev, M.; Nguyen, M.; Chang, C. H. C.; Baldassano, C.; Lositsky, O.; Simony, E.; Chow, M. A.; Leong, Y. C.; Brooks, P. P.; Micciche, E.; Choe, G.; Goldstein, A.; Vanderwal, T.; Halchenko, Y. O.; Norman, K. A.; Hasson, U.
Show abstract
The "Narratives" collection aggregates a variety of functional MRI datasets collected while human subjects listened to naturalistic spoken stories. The current release includes 345 subjects, 891 functional scans, and 27 diverse stories of varying duration totaling ~4.6 hours of unique stimuli (~43,000 words). This data collection is well-suited for naturalistic neuroimaging analysis, and is intended to serve as a benchmark for models of language and narrative comprehension. We provide standardized MRI data accompanied by rich metadata, preprocessed versions of the data ready for immediate use, and the spoken story stimuli with time-stamped phoneme- and word-level transcripts. All code and data are publicly available with full provenance in keeping with current best practices in transparent and reproducible neuroimaging.
Branco, V. V.; Cardoso, P.; Correia, L.
Show abstract
MotivationSPECTRE is an open-source database containing standardised spatial data on global environmental and anthropogenic variables that are potential threats to terrestrial species and ecosystems. Its goal is to allow users to swiftly access spatial data on multiple threats at a resolution of 30 arc-seconds for all terrestrial areas. Following the standard set by Worldclim, this data allows full comparability and ease of use under common statistical frameworks for global change studies, species distribution modelling, threat assessments, quantification of ecosystem services and disturbance, among multiple other uses. A web user interface, a persistent online repository, and an accompanying R package with functions for downloading and manipulating data are provided. Main types of variable containedSPECTRE is a GIS product with 24 geoTiff raster layers (with plans to expand in the near future) with an approximate 1 km2 resolution.
Bandrowski, A.; Grethe, J. S.; Pilko, A.; Gillespie, T. H.; Pine, G.; Patel, B.; Surles-Zeiglera, M.; Martone, M. E.
Show abstract
The NIH Common Funds Stimulating Peripheral Activity to Relieve Conditions (SPARC) initiative is a large-scale program that seeks to accelerate the development of therapeutic devices that modulate electrical activity in nerves to improve organ function. Integral to the SPARC program are the rich anatomical and functional datasets produced by investigators across the SPARC consortium that provide key details about organ-specific circuitry, including structural and functional connectivity, mapping of cell types and molecular profiling. These datasets are provided to the research community through an open data platform, the SPARC Portal. To ensure SPARC datasets are Findable, Accessible, Interoperable and Reusable (FAIR), they are all submitted to the SPARC portal following a standard scheme established by the SPARC Curation Team, called the SPARC Data Structure (SDS). Inspired by the Brain Imaging Data Structure (BIDS), the SDS has been designed to capture the large variety of data generated by SPARC investigators who are coming from all fields of biomedical research. Here we present the rationale and design of the SDS, including a description of the SPARC curation process and the automated tools for complying with the SDS, including the SDS validator and Software to Organize Data Automatically (SODA) for SPARC. The objective is to provide detailed guidelines for anyone desiring to comply with the SDS. Since the SDS are suitable for any type of biomedical research data, it can be adopted by any group desiring to follow the FAIR data principles for managing their data, even outside of the SPARC consortium. Finally, this manuscript provides a foundational framework that can be used by any organization desiring to either adapt the SDS to suit the specific needs of their data or simply desiring to design their own FAIR data sharing scheme from scratch.
LaMontagne, P. J.; Benzinger, T. L. S.; Morris, J. C.; Keefe, S.; Hornbeck, R.; Xiong, C.; Grant, E.; Hassenstab, J.; Moulder, K.; Vlassenko, A.; Raichle, M. E.; Cruchaga, C.; Marcus, D.
Show abstract
OASIS-3 is a compilation of MRI and PET imaging and related clinical data for 1098 participants who were collected across several ongoing studies in the Washington University Knight Alzheimer Disease Research Center over the course of 15 years. Participants include 605 cognitively normal adults and 493 individuals at various stages of cognitive decline ranging in age from 42 to 95 years. The OASIS-3 dataset contains over 2000 MR sessions, including multiple structural and functional sequences. PET metabolic and amyloid imaging includes over 1500 raw imaging scans and the accompanying post-processed files from the PET Unified Pipeline (PUP) are also available in OASIS-3. OASIS-3 also contains post-processed imaging data such as volumetric segmentations and PET analyses. Imaging data is accompanied by dementia and APOE status and longitudinal clinical and cognitive outcomes. OASIS-3 is available as an open access data set to the scientific community to answer questions related to healthy aging and dementia.
Jansen, J.; Shelamoff, V.; Gros, C.; Windsor, T.; Hill, N. A.; Barnes, D. K.; Bowden, D. A.; Gutt, J.; Bax, N.; Downey, R.; Eleaume, M.; Post, A. L.; Griffiths, H. J.; Linse, K.; Piepenburg, D.; Purser, A.; Smith, C. R.; Ziegler, A. F.; Johnson, C. R.
Show abstract
Marine imagery is a comparatively cost-effective way to collect data on seafloor organisms, biodiversity and habitat morphology. However, annotating these images to extract detailed biological information is time-consuming and expensive, and reference libraries of consistently annotated seafloor images are rarely publicly available. Here, we present the Antarctic Seafloor Annotated Imagery Database (AS-AID), a result of a multinational collaboration to collate and annotate regional seafloor imagery datasets from 19 Antarctic research cruises between 1985 and 2019. AS-AID comprises of 3,599 georeferenced downward facing seafloor images that have been labelled with a total of 615,051 expert annotations. Annotations are based on the CATAMI (Collaborative and Automated Tools for Analysis of Marine Imagery) classification scheme and have been reviewed by experts. In addition, because the pixel location of each annotation within each image is available, annotations can be viewed easily and customised to suit individual research priorities. This dataset can be used to investigate species distributions, community patterns, it provides a reference to assess change through time, and can be used to train algorithms to automatically detect and annotate marine fauna.
Kondo, M.; Sehara, K.; Harukuni, R.; Aoki, R.; Sugimoto, S.; Tanaka, Y. R.; Matsuzaki, M.; Nakae, K.
Show abstract
The BraiDyn-BC (Brain Dynamics underlying emergence of Behavioral Change) Database offers an extensive, multimodal dataset that links wide-field calcium imaging of the mouse neocortex to comprehensive behavioral measurements during a behavioral task. As one of the contents in this database, we newly provide a dataset that includes 15 sessions spanning two weeks of motor skill learning, in which 25 mice were trained to pull a lever to obtain water rewards. Simultaneous high-speed videography captures body, facial, and eye movements, and environmental parameters are monitored. The dataset also features resting-state cortical activity and sensory-evoked responses, enhancing its utility for both learning-related and sensory-driven neural dynamics studies. Data are formatted in accordance with the Neurodata Without Borders (NWB) standard, ensuring compatibility with existing analysis tools and adherence to the FAIR principles. This resource enables in-depth investigations into the neural mechanisms underlying behavior and learning. The platform encourages collaborative research, supporting the exploration of rapid within-session learning effects, long-term behavioral adaptations, and neural circuit dynamics.
Merida, I.; Jung, J.; Bouvard, S.; Le Bars, D.; Lancelot, S.; Lavenne, F.; Bouillot, C.; Redoute, J.; Hammers, A.; Costes, N.
Show abstract
We present a database of cerebral PET FDG and anatomical MRI for 37 normal adult human subjects (CERMEP-IDB-MRXFDG). Thirty-nine participants underwent [18F]FDG PET/CT and MRI, resulting in [18F]FDG PET, T1 MPRAGE MRI, FLAIR MRI, and CT images. Two participants were excluded after visual quality control. We describe the acquisition parameters, the image processing pipeline and provide participants individual demographics (mean age 38 {+/-} 11.5 years, range 23-65, 20 women). Volumetric analysis of the 37 T1 MRIs showed results in line with the literature. A leave-one-out assessment of the 37 FDG images using Statistical Parametric Mapping (SPM) yielded a low number of false positives after exclusion of artefacts. The database is stored in three different formats, following the BIDS common specification: 1) DICOM (data not processed), 2) NIFTI (multimodal images coregistered to PET subject space), 3) NIFTI normalized (images normalized to MNI space). Bona fide researchers can request access to the database via a short form.
Turner, J.; The wwPDB Consortium,
Show abstract
The Electron Microscopy Data Bank (EMDB) is the archive of three-dimensional electron microscopy (3DEM) maps of biological specimens. As of 2021, EMDB has been managed by the Worldwide Protein Data Bank (wwPDB) as a wwPDB Core Archive. Today, the EMDB houses over 29,000 entries with maps containing cells, organelles, viruses, complexes and macromolecules. Herein, we provide an overview of the rapidly growing EMDB archive, including its current holdings, recent updates, and future plans.
Niittynen, P.; Kemppinen, J.
Show abstract
We present here FennoTraits, which is a dataset of plant functional trait and community composition data which we collected from Fennoscandia across northern Finland, Norway, and Sweden in 2016-2025. This dataset has 42 049 abundance estimations and 155 794 functional trait observations from 10 traits representing 373 vascular plant species collected from 1 235 study sites within seven study areas. The trait measurements consist of size-structural, leaf economic, leaf spectral, and reproductive traits. The species represent the majority of the native vascular plant species that occur at the seven study areas, and many of the species occur in all seven areas across the two biomes and their ecotone: tundra and boreal forests. Each study area has distinct characteristics and a range of habitats: tundra, meadows, wetlands, shrublands, and boreal forests. These areas are under low anthropogenic influence, and many of the sites are within protected areas that are reserved for nature conservation and scientific research. Finally, we provide with this dataset a general description of the main trait patterns and profiles of the northern European flora.
Wang, M.-Y.; Korbmacher, M.; Eikeland, R.; Specht, K.
Show abstract
Populational brain imaging methods based on group averages provide valuable insights into the general functions of the brain. However, they often overlook the inherent inter- and intra-subject variability, limiting our understanding of individual differences. To address this limitation, researchers have turned to big datasets and deep brain imaging datasets. Big datasets enable the exploration of inter-subject variations, while deep brain imaging datasets, involving repeated scanning of multiple subjects over time, offer detailed insights into intra-subject variability. Despite the availability of numerous big datasets, the number of deep brain imaging datasets remains limited. In this article, we present a deep brain imaging dataset derived from the Bergen Breakfast Scanning Club (BBSC) project. The dataset comprises data collected from three subjects who underwent repeated scanning over the course of approximately one year. Specifically, three types of data chunks were collected: behavioral data, functional brain data, and structural brain data. Functional brain images, encompassing magnetic resonance spectroscopy (MRS) and resting-state functional magnetic resonance imaging (fMRI), along with their anatomical reference T1-weighted brain images, were collected twice a week during the data collection period. In total, 38, 40, and 25 sessions of functional data were acquired for subjects 1, 2, and 3, respectively. On the other hand, structural brain images, including T2-weighted brain images, diffusion-weighted images (DWI), and fluid-attenuated inversion recovery (FLAIR) images, were obtained once a month. A total of 10, 9, and 6 sessions were collected for subjects 1, 2, and 3, respectively. The primary objective of this article is to provide a comprehensive description of the data acquisition protocol employed in the BBSC project, as well as detailed insights into the preprocessing steps applied to the acquired data.
Perez, T. M.; Kullberg, A.; Rehm, E.; Feeley, K.
Show abstract
MotivationPlant heat tolerance data are increasingly valued for their potential to help increase our understanding of species responses to extreme temperatures, but these efforts are hindered by methodological inconsistencies and missing contextual information. To address this issue, we collated data that compiles heat tolerance estimates and documents key sources of variation attributable to taxonomy, methodology, geography, and cultivation to improve data clarity and usability. This resource is designed to catalyze more rigorous and ecologically meaningful syntheses by enabling researchers to identify, account for, and test the drivers of variation in plant heat tolerances and their consequences. Main types of variable containedHeat tolerance estimated in degrees Celsius from photosynthetic tissue Spatial location and grainGlobal in scope with undersaturated taxonomic sampling and underrepresented geographic regions. Time period and grain1935-2024 Major taxa and level of measurementPrimarily vascular plants encompassing >1700 taxa, >1000 genera and >200 families. Software formatComma-separated values
Piorkowska, N. J.; Mazurek, R.; Adamek, D.; Lopianiak, M.; Kubs, M.; Kulka, K.; Luszczek, P.; Mijal, R.
Show abstract
Behavioral datasets for invertebrate model organisms are rapidly expanding alongside automated imaging, tracking, and artificial intelligence (AI) based phenotyping, yet their technical structure and compliance with the Findable, Accessible, Interoperable and Reusable (FAIR) principles remain heterogeneous. We present a two-stage survey of openly available behavioural datasets for major invertebrate models Caenorhabditis elegans (C. elegans), Drosophila melanogaster (D. melanogaster), Galleria mellonella (G. mellonella), and planarians Schmidtea mediterranea (S. mediterranea) with larval zebrafish (Danio rerio) included as a vertebrate comparator. Stage 1 comprised a PRISMA-guided literature review (from 2015 to 2025) across indexed databases and complementary non-indexed sources, yielding 12 eligible publications describing 12 open behavioural datasets. Stage 2 independently screened and technically evaluated repository deposits (from June 2022 to July 2025), producing a final corpus of 20 datasets scored on a four-dimension ordinal rubric capturing usability, annotation richness, technical quality and AI-readiness. All extracted descriptors, repository search logs, and scoring sheets are released as public data records enabling full regeneration of figures and summary statistics. Across Stage 2 deposits, multimodality and open file formats were common, whereas interoperability and AI-readiness were most constrained by limited machine-readable metadata, weak raw-to-derived provenance, and sparse adoption of formal standards or ontologies. This Data Descriptor provides a reproducible, dataset-centred overview of behavioural resources for invertebrate models and practical guidance for FAIR-aligned publication, secondary biological analyses, and AI benchmarking.
Xiao, Y.; Lau, J. C.; Peters, T. M.; Khan, A. R.
Show abstract
Population-averaged brain atlases, that are represented in a standard space with anatomical labels, are instrumental tools in neurosurgical planning and the study of neurodegenerative conditions. Traditional brain atlases are primarily derived from anatomical scans and contain limited information regarding the axonal organization of the white matter. With the advance of diffusion MRI that allows the modelling of fiber orientation distribution (FOD) in the brain tissue, there is an increasing interest for a population-averaged FOD template, especially based on a large healthy aging cohort, to offer structural connectivity information for connectomic surgery and analysis of neurodegeneration. The dataset described in this article contains a set of multi-contrast structural connectomic MRI atlases, including T1w, T2w, and FOD templates, along with the associated whole brain tractograms. The templates were made using multi-contrast group-wise registration based on 3T MRIs of 422 Human Connectome Project in Aging (HCP-A) subjects. To enhance the usability, probabilistic tissue maps and segmentation of 22 subcortical structures are provided. Finally, the subthalamic nucleus shown in the atlas is parcellated into sensorimotor, limbic, and associative sub-regions based on their structural connectivity to facilitate the analysis and planning of deep brain stimulation procedures. The dataset is available on the OSF Repository: https://osf.io/p7syt. Specifications table O_TBL View this table: org.highwire.dtl.DTLVardef@59fb82org.highwire.dtl.DTLVardef@18f1f44org.highwire.dtl.DTLVardef@14de18dorg.highwire.dtl.DTLVardef@1318a8org.highwire.dtl.DTLVardef@1b7bea8_HPS_FORMAT_FIGEXP M_TBL C_TBL Value of the dataO_LIThese publicly available templates were created using 422 HCP-Aging subjects, representing averaged anatomical and structural connectivity features of the healthy aging brain. C_LIO_LIMatching T1w, T2w, and fiber orientation distribution (FOD) templates are provided, together with the associated tractograms that contain 20K and 2 million streamlines. C_LIO_LISegmentation of 22 subcortical structures and probabilistic tissue maps are provided to enhance the usability of the templates. C_LIO_LIParcellation of the subthalamic nucleus, based on structural connectivity of the included cohort at 0.3 x 0.3 x 0.3 mm3 resolution, is provided. C_LI
Zhylko, D.; Del Gallego, R.; Pardo, S.; Mahmoud, R.; Hsieh, Y. T.; Selim, S.; Nogueira, D.; El-Khatib, I.; Lawrenz, B.; Fatemi, H. M.; Shamout, F. E.
Show abstract
In this report, we present Version 1.0 of the Assisted Reproductive Technology (ART) Dataset, a multi-modal fertility dataset from treatments performed at the ART Fertility Clinic in Abu Dhabi, United Arab Emirates, between 2015 and 2022. The data consists of Electronic Health Records (EHR) and embryo development image sequences captured with the Vitrolife EmbryoScope time-lapse system, providing detailed treatment, morphology, and pregnancy outcome information. The final processed dataset consists of a total of 14,776 embryos from 1,810 patients across 2,500 treatments. This dataset will be used in the development of machine learning models for automated analysis of embryo development and viability, to assist clinical decision-making. This report provides a summary of the statistics of the dataset, as well as the extraction and pre-processing pipelines of the time-lapse images and EHR data. The dataset is private, so we publish this report for transparency on data pre-processing pipelines to share the methodology with similar studies that may arise.
Berzaghi, F.; Bretagnolle, F.; Ratshikombo, Z.; Ben Abdallah, A.
Show abstract
Plant nutritional properties (crude protein, fibers, minerals, and carbohydrates) and chemical defenses (tannins and phenols) are key traits determining food quality and feeding preferences of animals and humans. Plant nutritional properties are also relevant to crop production and livestock. Here we present PNuts, a global database containing > 1000 species and > 13,000 records of nutritional properties of different plant organs complete with location and time of collection (year/month/season). Species include crops and wild plants and are classified in six functional groups: legumes, herbs, grasses, lianas, shrubs and trees. Plant organs include leaf, fruit, seed, stem, twigs, flower, root, and bark. PNuts data can be used as inputs for ecological analyses and model parametrization requiring large amounts of data. PNuts provides an important tool to better understand the importance of nutritional properties in plant eco-physiology and the implications for humans and animals food quality and plant-animal interactions in a context of global changes.
Myers, P. E.; Arvapalli, G. C.; Ramachandran, S. C.; Pisner, D. A.; Frank, P. F.; Lemmer, A. D.; Bridgeford, E. W.; Nikolaidis, A.; Vogelstein, J. T.
Show abstract
Using brain atlases to localize regions of interest is a requirement for making neuroscientifically valid statistical inferences. These atlases, represented in volumetric or surface coordinate spaces, can describe brain topology from a variety of perspectives. Although many human brain atlases have circulated the field over the past fifty years, limited effort has been devoted to their standardization. Standardization can facilitate consistency and transparency with respect to orientation, resolution, labeling scheme, file storage format, and coordinate space designation. Our group has worked to consolidate an extensive selection of popular human brain atlases into a single, curated, open-source library, where they are stored following a standardized protocol with accompanying metadata, which can serve as the basis for future atlases. The repository containing the atlases, the specification, as well as relevant transformation functions is available at https://github.com/neurodata/neuroparc.
Gassert, F.; Stela, B.; Martinez, E. P.; Harfoot, M.
Show abstract
Monitoring, halting and reversing land conversion is fundamental to meeting international biodiversity and climate targets, and agriculture is the major driver of land conversion. We present an open access set of global data for calculating land use change impacts of agricultural supply chains. These data, originally prepared for the LandGriffon service, include indicators of deforestation, conversion of natural ecosystems, greenhouse gas emissions, and loss of intact or high integrity ecosystems following international standards and guidelines for reporting and target setting in the agriculture, forestry, and land use sector. In order to assign impacts to agricultural production, we prepare data using a spatial adaptation of the statistical Land Use Change (sLUC) accounting approach distributing impact to human activities across the local area using a 50km radius. The results are high resolution global maps of impact per hectare of land occupation. These can then be combined with land footprint data, cropland extent, or productivity maps to calculate land use change related impacts for specific crop volumes sourced from specific regions. Carbon and deforestation results are validated against FAO statistics at the national level.